Class: ThreatDetector::Scraper

Inherits:
Object
  • Object
show all
Extended by:
Enumerable, Forwardable
Includes:
Utility
Defined in:
lib/threat_detector/scraper.rb

Overview

Scrape a given feed URL from ThreatFeeds.io

This class generalizes to all feed URLs. Often times, custom settings are required for some feeds, which can be provided as a YAML config file. Each section in this file pertains to a scraper (identified by its name).

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Utility

#categorize_ip_or_uri, #categorize_uri, #feeds_config_path, #fetch_page, #raise_error, #refresh=, #refresh?, #sanitize_options, #working_directory

Constructor Details

#initialize(options = {}) ⇒ Scraper

Instantiate a new ThreatDetector::Scraper

The default options are:

working_directory: ~/.threat_detector
feeds_config_path: <gem_path>/threat_detector/feeds.yaml

Parameters:

  • opts (Hash)

    a customizable set of options

  • opts (Hash)

    options received from the user


31
32
33
34
# File 'lib/threat_detector/scraper.rb', line 31

def initialize(options = {})
  @options = sanitize_options(options)
  reset!
end

Instance Attribute Details

#configObject (readonly)

Returns the value of attribute config


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def config
  @config
end

#entriesObject (readonly)

Returns the value of attribute entries


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def entries
  @entries
end

#nameObject

Returns the value of attribute name


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def name
  @name
end

#optionsObject (readonly)

Returns the value of attribute options


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def options
  @options
end

#reasonObject (readonly)

Returns the value of attribute reason


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def reason
  @reason
end

#urlObject

Returns the value of attribute url


20
21
22
# File 'lib/threat_detector/scraper.rb', line 20

def url
  @url
end

Instance Method Details

#add_reason(message) ⇒ self

Add a reason to the current scraper instance for skipping a feed.

Returns:

  • (self)

79
80
81
82
# File 'lib/threat_detector/scraper.rb', line 79

def add_reason(message)
  @reason ||= message
  self
end

#cached?Bool

Check whether the cached file exists? Also, check if we are not refreshing the feeds?

Returns:

  • (Bool)

73
74
75
# File 'lib/threat_detector/scraper.rb', line 73

def cached?
  !refresh? && File.exist?(save_path)
end

#configured?Bool

Check if the scraper is configured via YAML scraping configuration?

Returns:

  • (Bool)

67
68
69
# File 'lib/threat_detector/scraper.rb', line 67

def configured?
  !@config.empty?
end

#for(name, url) ⇒ self

Reset the scraper instance to work for a feed with provided name and URL. Though, this method returns the resetted scraper instance for chaining purposes.

Parameters:

  • name (String)

    name for this scraper

  • url (String, URI)

    url for this scraper

Returns:

  • (self)

    resetted scraper instance with name and URL for feed configured


59
60
61
62
63
# File 'lib/threat_detector/scraper.rb', line 59

def for(name, url)
  self.url = url
  self.name = name
  self
end

#parseself

Note:

If a feed has alreayd been downloaded locally, it will be skipped. If you want to fetch such a feed again, you need to set `refresh` attribute to `true`.

Method that scrapes and parses the page for a feed with given name and URL

We use a generalized method #fetch_entries_via for most feeds, and resort to custom methods defined in this class for scraping some feeds.

Every time that we ignore/skip a feed for some reason, we update the scraper instance to contain the reason for skipping that feed.

Returns:

  • (self)

    scraper instance with reason for skipping and/or entries after scraping was performed


98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/threat_detector/scraper.rb', line 98

def parse
  return add_reason('Found cached entries') if cached?

  fetch_page url
  return add_reason('Invalid page response') unless valid_page?

  method = @config['custom'] ? "parse_#{name}" : :fetch_entries
  @entries = send(method)

  empty? ? add_reason('No entries found') : self
rescue Curl::Err::MalformedURLError
  add_reason 'Malformed URL passed'
rescue Curl::Err::TimeoutError
  add_reason 'Timeout received'
end

#parse_and_save_entries {|entries| ... } ⇒ self

Utility method to scrape and save entries so obtained.

Yields:

  • (entries)

    hook to work with entries after scraping current feed

Yield Parameters:

  • entries (Array<String>)

    scraper entries fetched from the current feed

Returns:

  • (self)

132
133
134
135
136
137
# File 'lib/threat_detector/scraper.rb', line 132

def parse_and_save_entries
  parse
  save_entries
  yield(entries) if block_given?
  entries
end

#save_entriesString?

Note:

No entries will be saved if the name or URL is not set.

Save entries to local cache files. These files are different than the Trie based dumps, and are useful to quickly update/sync our data from online sources

Returns:

  • (String, nil)

    path to the file where entries were saved


120
121
122
123
124
125
# File 'lib/threat_detector/scraper.rb', line 120

def save_entries
  return if empty?

  File.open(save_path, 'w') { |f| f.puts @entries }
  save_path
end

#save_pathString?

Path to file where entries from current feed will be saved. Since, there can be multiple feeds with the same name, the save path for a feed is appended with a MD5 hash substring.

Returns:

  • (String, nil)

    Path for the file or nil if name or URL is not set


143
144
145
146
147
148
149
150
151
# File 'lib/threat_detector/scraper.rb', line 143

def save_path
  return unless name && url

  path = File.join(working_directory, 'feeds')
  FileUtils.mkdir_p(path) unless File.directory?(path)

  hash = Digest::MD5.hexdigest(url)
  File.join(path, "#{name}-#{hash[0..8]}.txt")
end