Class: UrlPrivacy

Inherits:
Object
  • Object
show all
Defined in:
lib/url_privacy.rb

Overview

Usage:

UrlPrivacy.clean(url)

Constant Summary collapse

TRACKING_PARAMS =

Remove these params from URLs. Taken from Neat URL and CleanURLs plus some others manually found.

%w[pf_rd_*@imdb.com [email protected] gclid ref
terminal_id igshid tracking_id action_object_map action_type_map
action_ref_map spm@*.aliexpress.com scm@*.aliexpress.com
aff_platform aff_trace_key algo_expid@*.aliexpress.*
algo_pvid@*.aliexpress.* btsid ws_ab_test pd_rd_*@amazon.*
_encoding@amazon.* psc@amazon.* tag@amazon.* ref_@amazon.*
pf_rd_*@amazon.* pf@amazon.* qid@amazon.* sr@amazon.*
srs@amazon.* __mk_*@amazon.* spIA@amazon.* ms3_c@amazon.*
ie*@amazon.* refRID@amazon.* colid@amazon.* coliid@amazon.*
*adId@amazon.* qualifier@amazon.* _encoding@amazon.*
smid@amazon.* field-lbr_brands_browse-bin@amazon.* ved@google.*
bi*@google.* gfe_*@google.* ei@google.* source@google.*
gs_*@google.* site@google.* oq@google.* esrc@google.*
uact@google.* cd@google.* cad@google.* gws_*@google.*
atyp@google.* vet@google.* zx@google.* _u@google.* je@google.*
dcr@google.* ie@google.* sei@google.* sa@google.* dpr@google.*
hl@google.* btn*@google.* sa@google.* usg@google.* cd@google.*
cad@google.* uact@google.* [email protected]
[email protected] [email protected] [email protected]
[email protected] [email protected] [email protected] [email protected]
[email protected]* ath*@walmart.com* utm_* ga_source ga_medium
ga_term ga_content ga_campaign ga_place yclid _openstat
fb_action_ids fb_action_types fb_source fb_ref fbclid
action_object_map action_type_map action_ref_map gs_l mkt_tok
hmb_campaign hmb_medium hmb_source ref ref_ ref_*@twitter.com
[email protected] trackId@netflix.* tctx@netflix.* jb*@netflix.*
[email protected] [email protected] [email protected]
[email protected] guce_referrer_*@techcrunch.com
[email protected] [email protected] [email protected] [email protected] [email protected]
[email protected] [email protected] [email protected] [email protected]
tt_medium@twitch.* tt_content@twitch.* [email protected]
[email protected] [email protected] [email protected]
*[email protected] [email protected] [email protected] [email protected] [email protected]
[email protected] [email protected] [email protected] [email protected] _trkparms@ebay.*
_trksid@ebay.* _from@ebay.* [email protected] [email protected]
[email protected] [email protected] [email protected] [email protected] [email protected]
[email protected] [email protected] mkt_tok trk trkCampaign ga_* gclid
gclsrc hmb_campaign hmb_medium hmb_source spReportId spJobID
spUserID spMailingID itm_* s_cid elqTrackId elqTrack assetType
assetId recipientId campaignId siteId mc_cid mc_eid pk_*
sc_campaign sc_channel sc_content sc_medium sc_outcome sc_geo
sc_country utm_* nr_email_referer vero_conv vero_id yclid
_openstat mbid cmpid cid c_id campaign_id Campaign hash@ebay.*
fb_action_ids fb_action_types fb_ref fb_source fbclid
[email protected] [email protected] gs_l gs_lcp@google.*
ved@google.* ei@google.* sei@google.* gws_rd@google.*
gs_gbg@google.* gs_mss@google.* gs_rn@google.* _hsenc _hsmi
__hssc __hstc hsCtaTracking [email protected]
[email protected] tt_medium tt_content lr@yandex.*
redircnt@yandex.* [email protected] [email protected] wt_zmc
source@google.* iflsig@google.* sclient@google.*
[email protected] [email protected] [email protected]
[email protected] hc_*@facebook.com *ref*@facebook.com
[email protected] [email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected] [email protected]
[email protected] [email protected] [email protected]
[email protected]].uniq.freeze

Class Method Summary collapse

Class Method Details

.clean(url) ⇒ String

Clean the given URL. If the URL can’t be parsed, returns the URL unmodified.

Caches in case there’re duplicates.



81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/url_privacy.rb', line 81

def clean(url)
  @cleaned_urls ||= {}
  @cleaned_urls[url] ||= begin
    uri = URI(url)

    if uri.query && uri.hostname
      hostname = uri.hostname.sub(/\Awww\./, '')
      params = URI.decode_www_form(uri.query).to_h

      # Remove params by name first
      params.reject! do |param, _|
        TRACKING_PARAMS.include? param
      end

      # Remove params with globs
      params.reject! do |param, _|
        simple_tracking_params.any? do |pattern_param|
          File.fnmatch(pattern_param, param)
        end
      end

      # Remove params matching by hostname and then param
      params.reject! do |param, _|
        complex_tracking_params.any? do |pattern_hostname, pattern_params|
          next false unless File.fnmatch(pattern_hostname, hostname)

          pattern_params.any? do |pattern_param|
            File.fnmatch(pattern_param, param)
          end
        end
      end

      uri.query = URI.encode_www_form(params)
    end

    uri.to_s
  end
rescue URI::Error
  @cleaned_urls[url] ||= url
end