Module: NewsCrawler::CrawlerModule

Included in:
LinkSelector::SameDomainSelector, Processing::StructureAnalysis
Defined in:
lib/news_crawler/crawler_module.rb

Overview

Include this to get basic module methods

Instance Method Summary collapse

Instance Method Details

#find_all(state, max_depth = -1)) ⇒ Array

Find one visited url with given current module process state

Parameters:

  • state (String)

    one of unprocessed, processing, processed

  • max_depth (Fixnum) (defaults to: -1))

    max url depth return (inclusive)

Returns:

  • (Array)

    URL list



53
54
55
# File 'lib/news_crawler/crawler_module.rb', line 53

def find_all(state, max_depth = -1)
  URLQueue.find_all(self.class.name, state, max_depth)
end

#find_one(state, max_depth = -1)) ⇒ String?

Find all visited urls with current module’s state

Parameters:

  • state (String)
  • max_depth (Fixnum) (defaults to: -1))

    max url depth return (inclusive)

Returns:

  • (String, nil)

    URL or nil if url doesn’t exists



61
62
63
# File 'lib/news_crawler/crawler_module.rb', line 61

def find_one(state, max_depth = -1)
  URLQueue.find_one(self.class.name, state, max_depth)
end

#find_unprocessed(max_depth = -1)) ⇒ Array

Find all visited unprocessed url

Parameters:

  • max_depth (Fixnum) (defaults to: -1))

    max url depth return (inclusive)

Returns:

  • (Array)

    URL list



45
46
47
# File 'lib/news_crawler/crawler_module.rb', line 45

def find_unprocessed(max_depth = -1)
  URLQueue.find_all(self.class.name, URLQueue::UNPROCESSED, max_depth)
end

#load_yaml(key, value) ⇒ Object?

Load YAML object

Parameters:

  • key (String)

Returns:

  • (Object, nil)


86
87
88
# File 'lib/news_crawler/crawler_module.rb', line 86

def load_yaml(key, value)
  YAMLStor.get(self.class.name, key, value)
end

#mark_all_as_unprocessedObject



72
73
74
# File 'lib/news_crawler/crawler_module.rb', line 72

def mark_all_as_unprocessed
  URLQueue.mark_all(self.class.name, URLQueue::UNPROCESSED)
end

#mark_processed(url) ⇒ Object

Mark current url process state of current module is processed

Parameters:

  • url (String)


32
33
34
# File 'lib/news_crawler/crawler_module.rb', line 32

def mark_processed(url)
  URLQueue.mark(self.class.name, url, URLQueue::PROCESSED)
end

#mark_unprocessed(url) ⇒ Object

Mark current url process state of current module is unprocessed

Parameters:

  • url (String)


38
39
40
# File 'lib/news_crawler/crawler_module.rb', line 38

def mark_unprocessed(url)
  URLQueue.mark(self.class.name, url, URLQueue::UNPROCESSED)
end

#next_unprocessed(max_depth = -1)) ⇒ String?

Get next unprocessed a url and mark it as processing in atomic

Parameters:

  • max_depth (Fixnum) (defaults to: -1))

    max url depth return (inclusive)

Returns:

  • (String, nil)

    URL or nil if url doesn’t exists



68
69
70
# File 'lib/news_crawler/crawler_module.rb', line 68

def next_unprocessed(max_depth = -1)
  URLQueue.next_unprocessed(self.class.name, max_depth)
end

#save_yaml(key, value) ⇒ Object

Serialize object to YAML and save it (overwrite if key existed)

Parameters:

  • key (String)
  • value (Object)


79
80
81
# File 'lib/news_crawler/crawler_module.rb', line 79

def save_yaml(key, value)
  YAMLStor.add(self.class.name, key, value)
end