Class: RightScraper::Retrievers::CheckoutBase

Inherits:

Object
Base
RightScraper::Retrievers::CheckoutBase

Defined in:: lib/right_scraper/retrievers/checkout_base.rb

Overview

Base class for retrievers that want to do version control operations (CVS, SVN, etc.). Subclasses can get away with implementing only Retrievers::Base#available? and #do_checkout but to support incremental operation need to implement #exists? and #do_update, in addition to Retrievers::Base#ignorable_paths.

Direct Known Subclasses

Git, Svn

Instance Attribute Summary

Attributes inherited from Base

#logger, #max_bytes, #max_seconds, #repo_dir, #repository

Instance Method Summary collapse

#do_checkout ⇒ TrueClass

Perform a de novo full checkout of the repository.
#do_update ⇒ TrueClass

Perform an incremental update of the checkout.
#do_update_tag ⇒ TrueClass

Updates the tag of the repository associated with this retriever to refer to the HEAD commit (SHA) on disk after retrieval.
#exists? ⇒ Boolean

Return true if a checkout exists.
#remote_differs? ⇒ TrueClass|FalseClass

Determines if the remote SHA/tag/branch referenced by the repostory differs from what appears on disk, if possible.
#retrieve ⇒ Object

Attempts to update and then resorts to clean checkout for repository.
#size_limit_exceeded? ⇒ TrueClass|FalseClass

Determines if total size of files in repo_dir has exceeded size limit.

Methods inherited from Base

#available?, #ignorable_paths, #initialize, repo_dir

Constructor Details

This class inherits a constructor from RightScraper::Retrievers::Base

Instance Method Details

#do_checkout ⇒ `TrueClass`

Perform a de novo full checkout of the repository. Subclasses must override this to do anything useful.

Returns:

(TrueClass) —

always true

Raises:

(NotImplementedError)



157
158
159

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 157

def do_checkout
  raise NotImplementedError
end

#do_update ⇒ `TrueClass`

Perform an incremental update of the checkout. Subclasses that want to handle incremental updating need to override this.

Returns:

(TrueClass) —

always true

Raises:

(NotImplementedError)



165
166
167

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 165

def do_update
  raise NotImplementedError
end

#do_update_tag ⇒ `TrueClass`

Updates the tag of the repository associated with this retriever to refer to the HEAD commit (SHA) on disk after retrieval.

Returns:

(TrueClass) —

always true

Raises:

(NotImplementedError)



173
174
175

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 173

def do_update_tag
  raise NotImplementedError
end

#exists? ⇒ `Boolean`

Return true if a checkout exists.

Returns

Boolean: true if the checkout already exists (and thus incremental updating can occur).

Returns:

(Boolean)



117
118
119

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 117

def exists?
  false
end

#remote_differs? ⇒ `TrueClass|FalseClass`

Determines if the remote SHA/tag/branch referenced by the repostory differs from what appears on disk, if possible. Not all retrievers will have this capability. If not, the retriever should default to returning true to indicate that the remote is changed.

Returns:

(TrueClass|FalseClass) —

true if changed



127
128
129

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 127

def remote_differs?
  true
end

#retrieve ⇒ `Object`

Attempts to update and then resorts to clean checkout for repository.

Raises:

(RetrieverError)

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 39

def retrieve
  raise RetrieverError.new("retriever is unavailable") unless available?
  updated = false
  explanation = ''
  if exists?
    @logger.operation(:updating) do
      # a retriever may be able to determine that the repo directory is
      # already pointing to the same commit as the revision. in that case
      # we can return quickly.
      if remote_differs?
        # there is no point in updating and failing the size check when the
        # directory on disk already exceeds size limit; fall back to a clean
        # checkout in hopes that the latest revision corrects the issue.
        if size_limit_exceeded?
          explanation = 'switching to checkout due to existing directory exceeding size limimt'
        else
          # attempt update.
          begin
            do_update
            updated = true
          rescue ::RightScraper::Processes::Shell::LimitError
            # update exceeded a limitation; requires user intervention
            raise
          rescue Exception => e
            # retry with clean checkout after discarding repo dir.
            explanation = 'switching to checkout after unsuccessful update'
          end
        end
      else
        # no retrieval needed but warn exactly why we didn't do full
        # checkout to avoid being challenged about it.
        repo_ref = @repository.tag
        do_update_tag
        full_head_ref = @repository.tag
        abbreviated_head_ref = full_head_ref[0..6]
        if repo_ref == full_head_ref || repo_ref == abbreviated_head_ref
          detail = abbreviated_head_ref
        else
          detail = "#{repo_ref} = #{abbreviated_head_ref}"
        end
        message =
          "Skipped updating local directory due to the HEAD commit SHA " +
          "on local matching the remote repository reference (#{detail})."
        @logger.note_warning(message)
        return false
      end
    end
  end

  # Clean checkout only if not updated.
  unless updated
    @logger.operation(:checkout, explanation) do
      # remove any full or partial directory before attempting a clean
      # checkout in case repo_dir is in a bad state.
      if exists?
        ::FileUtils.remove_entry_secure(@repo_dir)
      end
      ::FileUtils.mkdir_p(@repo_dir)
      begin
        do_checkout
      rescue Exception
        # clean checkout failed; repo directory is in an undetermined
        # state and must be deleted to prevent a future update attempt.
        if exists?
          ::FileUtils.remove_entry_secure(@repo_dir) rescue nil
        end
        raise
      end
    end
  end
  true
end

#size_limit_exceeded? ⇒ `TrueClass|FalseClass`

Determines if total size of files in repo_dir has exceeded size limit.

Return

Returns:

(TrueClass|FalseClass) —

true if size limit exceeded

# File 'lib/right_scraper/retrievers/checkout_base.rb', line 135

def size_limit_exceeded?
  if @max_bytes
    # note that Dir.glob ignores hidden directories (e.g. ".git") so the
    # size total correctly excludes those hidden contents that are not to
    # be uploaded after scrape. this may cause the on-disk directory size
    # to far exceed the upload size.
    globbie = ::File.join(@repo_dir, '**/*')
    size = 0
    ::Dir.glob(globbie) do |f|
      size += ::File.stat(f).size rescue 0 if ::File.file?(f)
      break if size > @max_bytes
    end
    size > @max_bytes
  else
    false
  end
end

Class: RightScraper::Retrievers::CheckoutBase

Overview

Direct Known Subclasses

Instance Attribute Summary

Attributes inherited from Base

Instance Method Summary collapse

Methods inherited from Base

Constructor Details

Instance Method Details

#do_checkout ⇒ TrueClass

#do_update ⇒ TrueClass

#do_update_tag ⇒ TrueClass

#exists? ⇒ Boolean

Returns

#remote_differs? ⇒ TrueClass|FalseClass

#retrieve ⇒ Object

#size_limit_exceeded? ⇒ TrueClass|FalseClass

Return

#do_checkout ⇒ `TrueClass`

#do_update ⇒ `TrueClass`

#do_update_tag ⇒ `TrueClass`

#exists? ⇒ `Boolean`

#remote_differs? ⇒ `TrueClass|FalseClass`

#retrieve ⇒ `Object`

#size_limit_exceeded? ⇒ `TrueClass|FalseClass`