Helloπ,
today we’re gonna have a look at the first component of my data collection pipeline: the browser extension. If you want to peek into the code, find it here.
Intro
First, some background: what I needed was a browser extension, which runs on as many browsers as possible without having too much of a hassle supporting all of them. Luckily, Chrome (and all other Chromium-based browsers like Opera, Edge or Brave) and Firefox basically share the same API, called WebExtensions API. I won’t go into much detail, so if you’re not familiar with how browser extensions look like and what they can do, check out this awesome post:
The resulting manifest.json looks like this:
{
"manifest_version": 2,
"name": "Twitter Tracking Extension",
"version": "$VERSION",
"description": "A browser extension that captures Twitter clicks - for scientific reasons.",
"homepage_url": "https://github.com/alxgrk/twitter-tracking",
"browser_specific_settings": {
"gecko": {
"id": "twittertracking@university.de",
"strict_min_version": "48.0",
"update_url": "https://github.com/alxgrk/twitter-tracking/blob/master/twitter-tracking-browser-extension/firefox-addon-update.json"
}
},
"permissions": [
"storage",
"unlimitedStorage",
"$ACCESS_SITE_PERMISSION"
],
"icons": {
"48": "icons/icon-48.png",
"96": "icons/icon-96.png"
},
"browser_action": {
"default_icon": {
"32" : "icons/icon-32.png"
},
"default_title": "Twitter Tracking",
"default_popup": "popup/tracked-events.html"
},
"content_scripts": [
{
"matches": ["*://*.twitter.com/*"],
"js": ["js/main.js"]
}
]
}
$...
- they are going to be replaced by webpack. We’ll see how in the next section.
’ll see how in the next section.
The Building Process
First, let’s have a look at how the final artifacts are build. I’m using Webpack’s
copy-webpack-plugin
and
webpack.DefinePlugin
to differentiate when building for dev and prod. The CopyWebpackPlugin enables you to move third-party dependencies where needed or replace placeholders in files like the
manifest.json
. The following snippet shows how to set version and access site permission & bundle the
webextension-polyfills
necessary to ensure interoperability between Chrome and Firefox for production:
new CopyWebpackPlugin({
patterns: [
{
from: 'src',
globOptions: {
ignore: ['**/*.js'],
},
transform(content, absoluteFrom) {
if (absoluteFrom.includes("manifest.json")) {
return content.toString()
.replace("$VERSION", process.env.npm_package_version)
.replace("$ACCESS_SITE_PERMISSION", dotenv.parsed.PROD_API_URL);
} else {
return content
}
}
}, {
from: 'node_modules/webextension-polyfill/dist/browser-polyfill.js',
to: path.resolve(distRoot, 'js')
}
]
}
To set environment variables like the endpoint of the REST-API, DefinePlugin has to be used as follows:
new webpack.DefinePlugin({
'process.env.NODE_ENV': '"production"',
API: JSON.stringify(dotenv.parsed.PROD_API_URL)
})
As you may have noticed, the value for the
API
comes from a variable named
dotenv
, which is also the name of the library, that reads a
.env
file from your current working directory and provides its values as JS objects. To optimize the production artifact, I also added the TerserPlugin like this:
optimization: {
minimize: true,
minimizer: [new TerserPlugin()],
}
Since we now have everything ready to build our artifacts, we need to make sure that Chrome, Firefox, etc. permit their installation. To achieve this, we need two different tools:
web-ext
for Firefox and
chromium
for Chrome. Check out this script for an example on how to use them. Also, make sure to follow the instructions on publishing extensions for Firefox and for Chrome.
The Collection Process
Once the user has the extension installed, the content-script located at
src/main.js
runs as soon as the user visits Twitter. And now the funny stuff begins.
What I did was to register some listeners to the standard browser events like
load
,
beforeunload
,
click
or
scroll
. If you remember the model from the previous post, we now need to deduct from these events what temporal dimension and what kind of user interactions we have.
load
and
beforeunload
give us the
start
and
end
events, which are going to be published to the backend and additionally stored in local storage for the user to be displayed later. While constructing the events, we also have to create some kind of identifier for the user, ideally a reproducible one if we want to relate his/her Android app interactions. To achieve this, the username has to be found by inspecting the
alt
attribute of the user’s profile image. And since we care (at least a little) about privacy, a consistent and “secure” hashing method like SHA256 is applied to it.
scroll
maps trivially to the scroll events needed. Just use
document.documentElement.scrollTop
to get the current scroll position.
Concerning
click
s, however, a little more fiddling is required. Unfortunately, Twitter HTML tags are almost all
div
s with randomized classnames. The only anchor are semantic tags like
article
, rare element IDs and the attribute
data-testid
. Luckily, with these and a great tool named
@medv/finder
, I was able to create RegExes that match the found selector to my predefined click types. So for any click events, the
finder
is run…
let selector = finder(event.target, {
className: name => false // do not rely on classnames
});
…and the result is mapped. All mapping functions are stored in an object, whose keys are the names of the click type. For e.g. a like click, this is the RegEx:
const TARGETS = {
like: selector => /# tweet-action-buttons.+:nth-child\(3\).*/.test(selector),
...
}
As you can imagine, this approach is quite error-prone. But that is one of the difficulties when reverse-engineering a constantly changing external system and I haven’t found an alternative yet. If you know it better, please comment.
Thanks for reading and have fun until next time when the post about the Android app is finished. βοΈ