11th Quattor Workshop - CERN - 16-18/3/2011
Quattor Status Report - M. Jouvin
See Slides.
Core Tools
Quattor Configuration Modules - L. Munoz Meijias
Recent addition
ncm-ofed
(from Stijn): Infiniband configuration, very preliminary
Obsolete/suspect modules
ncm-http
: used only at CERNncm-tomcat
: used only at CERNncm-tomcat2
: who uses it?ncm-rproxy
: replaced byncm-squid
?ncm-smartd
andncm-smartd5
: same schema, what is the difference?
Component quality: testing is tedious, old bugs are often back…
- CAF is a critical tool to help with component reliability
- Can run commands, update/write files with traceability and a real test suite
- LC status: still maintained, private copy in Quattor
- LC::File should not be used directly, CAF::FileWriter and CAF::FileEditor instead
- LC::Process no longer expands to a subshell: new version will allow to split command on whitespaces
ncm-accounts
: rewritten twice in last 2 years, hope v6 will fix all
problems definitely
- Very fast with large number of accounts
- It correctly handles creation of user’s home directories
ncm-networks
: lot of complains, ugly code, need rewriting, must review
what the component should really do
- Define IP config for network cards
- Define bridge devices: required for virtualization support
- Tuning of device parameters
- Configuration of channel bonding: kernel module manipulations should be done with ncm-modprobes
- Must work on diskless servers: no more restart of the network just for testing
- TCP-parameters should be controlled by ncm-sysctl
- IPv6 support: schema extension needed
- Loic may try to look at it
ncm-sudo
: CERN needs to be able to manage sudoers.ldap
ncm-useraccess
: proposal to remove PAM-based ACL support
- Mistake to have put it here, use
ncm-pam
instead
ncm-authconfig
rewritten with LDAP support
ncm-modprobe
: support for SL4 to SL6
- SL3 support removed
ncm-spma
v2: new incompatible schema, 2-phase upgrade needed
ncm-chkconfig
: problem with removal of ability to specify both on
and off
- Need ability to say “set service on to runlevel “,4,5 but off at other levels”
CCM: no changes since Feb. 2009, MS patches still not there?
- Still no support for local profiles required for easier testing
ncm-ncd
: version 1.3 by MS allows subclassing
- Some components won’t work with 1.2 in the future
- Should try to enable
-t
in Perl to get warnings about dangerous components cdp-listend
andncm-cdispd
should drop privileges when not needed
RHEL6 support: another platform to maintain… introduces Perl 5.12
- Major language changes but backward compatible
- Use stricter modes in Perl 5.12 to identify improvements needed to existing components
- Don’t jump to quickly into new, incompatible features
- Core libraries (CAF, LC) are not affected by the changes
Some actions needed
- Identify obsolete components and move them to another location (delete them?)
- Identify and fix components using backquotes to spawn commands instead of CAF::Process
- Do a Quattor component release? or official/recommended list of components?
- Can the new build tools help with this?
- What process?
Luis will no longer be able to (significantly) contribute as he’s leaving CERN…
Aquilon and other MS Developments Status - N. Williams
Main developments: integration with ESX through QRD
Profile signature: decided it was not the way to go, just encrypting profiles over the wire through Krb host keytab
ncm-network: MS would like to be involved in rewrite, ready to contribute
ncm-modprobes: difficulties with last version because of removal of shell expansion
RHEL6: found various issues, some of them not identified at CERN (in particular AII-related ones)
- Goal: support ready in a few months, fixing problems as they are discovered
Aquilon: will add support for “clusters” in a way similar to SCDB to help with use of QWG templates.
Will start looking at support Solaris 11 in the future.
- Probably not before autumn
May also look at feeding their internal Windows management system from Quattor.
Pan Compiler Status - C. Loomis
v8.5-7:
- Warning about all v9 deprecated feature
- Fix for NS object templates
- Fix for panc on Windows
- Introduction of “prefix” statement to simplify paths in templates
- In action from where it is declared to the end of the template, several possible in the same template but not recommended
v9
- v9.0: available soon, development version
- v9.2: early summer, production version
v9 main new features for the first versions
- Fixed generation of annotation output for better handling of namespaces
- Fix command line script
- New ant task and maven plugin
v9 roadmap
- Investigate use of clojure (lisp-like over JVM): better implementation of mem mgt in panc, agent model similar to panc task mgt, memorization of file system stats…
- May allow to simplify/streamline panc code
- Focus on code maintainability rather than new features (none requested)
Support
- v8.4.7 is the last v8 version
- v8.2.x is now unsupported, please upgrade
- v9.x: all new developments in these releases
Requests from the discussion
- debug() outside DML (Nick)
- Produce a “map” of the includes without executing (compiling) them (Nick)
- Difficult when using loadpath
- clojure may help
QWG Templates
QWG Templates Update - M. Jouvin
See slides.
Nagios Templates - R. Starink
Almost done
- monitoring/nagios/config: server configuration
- RPM includes
- Apache configuration
Difference between generic and local configuration: most work done locally, still need to be merged
Pnp4Nagios: service tpl needs site modifications
Hierarchy of Nagios servers: optional
Grid monitoring: based on EGEE/EGI work
- Uses YAIM, not easy to integrate into existing QWG templates. More discussion required.
- Replace ncm-yaim by ncm-filecopy?
- Not generic
Summary
- No visible progress yet but some progress behind the scenes
- Merge into SVN requires time and dedication: would like to test results
- Grid monitoring to be done after completion of base work
LEMON - I. Fedorko
Exception sensor: runs on the monitored node to report abnormal conditions (based on local conditions and metrics collected by other sensors on the node) and optionally run a corrective action.
LEMON usage: CERN (several instances: one at CC + online farms), CNAF, MS
Backends
- Supported: Oracle and flat-files
- No alarming possible with flat-files: rely on PL/SQL (available for most DBs)
- Postgres/MySQL support should be possible to add as internally LEMON is DB vendor independent
LEMON usage at CERN: 11K entities (8K machines), 5+ Knodes with 150-200 metrics
- Performance monitoring
- Application monitoring
- Infrastructure metrics: temperature, power…
- 5 core sensors covering 60% of measurements
- 1 LEMON server (+1 for historical data)
Quattor management of LEMON
- LEMON agent configured with ncm-fmonagent
- LEMON client authenticated to the server via the SINDES host certificates
Current activities
- Federation of instances (worked with MS)
- Virtual machines monitoring: Quattor used to generate images (golden nodes) leading to several VM instances sharing the same SINDES certificate
- Also customized sensors for hypervisors
- New LEMON web for better scalability and improved flexibility (more dynamic operations)
Future
- Consolidation of monitoring activities at CERNIT, including experiment support
- New DB schema for increased scalability
- DB data export for data mining
- Exception classification for alarming
- Remote instances (monitoring of remote T0)
- Probably integrating with Nagios for collection
- Interfacing to Nagios: interested by ideas/experiences
- Interfacing with Windows monitoring
LEMON development team at CERN: 0.5 to 1.5 FTE
- Enough for maintenance for CERN usage
- Not enough to do major reengineering: a problem as the use cases have changed
- Consolidating monitoring activities (and manpower) is one direction
- Interacting with other tools is the other direction
CMS Experience with Quattor - J. A. Coarasa Perez
CMS online computing infrastructure requirements
- Complete autonomy from CERN IT and network
- Myrinet core network
- Redundant services (2 different computing rooms distant of 200m), managed remotely from a central control room
- Fast configuration turnaround
- Ability to read from electronics at 100 kHz (100 GB/s)
- Computing power to do data selection (2 GB/s output): ~2500 cores
- Enough disk storage for 2 days: ~300 TB
- Spare computers for quick replacement in case of failure (~200)
Limited manpower: ~5 FTEs
Quattor is the core configuration management tool
- Computers installed through Quattor but not using AII
- Based on a (very) old Quattor version: 1.3
- panc 6
- RPM repositories hosted on NetApp filers with 6 SWrep servers
- Performance: 10mn for compiling the whole 2500 machines
- Time to deploy changes: ~10 mn (cdp notification used)
- Reinstallation of 1K machines in 1h 1/4
- Problem when SWrep servers were overloaded: spma timeout, no retry
Well defined template structure/layout (CMS specific)
Some CMS-specific developments
- Dropbox for RPMs to allow people without Quattor expertise to update their SW in a traceable/reproducible way
- Buggy components requiring workarounds (in particular for configuration removal)
- Copy of configuration files to individual computers: copyd
- Network configuration derived from DNS
- RPMs used to create users
- Web frontend to CDB configuration: template summarizer (parsing part of the profiles according to CMS way of configuring systems)
- Software Update Tools: allow to update existing RPM in a group of computers (dropbox)
- Upgrade or downgrade
- Allows to rollback changes
- Allow to check the status of the update
- Perform checks on the RPM: already exists, not named properly…
- Cannot be used to add a new RPM to the configuration: need to request addition of a new RPM
- Only one person can run it at any time
- Ability to run commands on a set of computers selected based on the use of a given template (like a type)
Conclusions
- Quattor has been and is scalable to CMS needs
- Not always as flexible as we’d like…
Security
SINDES Update - J. Dudziec
Certificate authority + secured file store.
Several issues requiring major reengineering
- Only one SINDES user per system: no fine grain permissions for different users
- Impossible to delete a file in the file store
- Impossible a to move a machine from one cluster to another one
- No support of subclusters
- Unattended installation of machines not really possible: SINDES must be notified first
Progress since last workshop
- Documentation on Quattor wiki
- Code in SF with Apache2 license: still some issues (currently GPL)
SINDES v2 overview
- Apache + SOAP
- Internal DB for SINDES-specific information, eg. SINDES items (file + permissions)
- Interaction with CDB through CDB2SQL to retrieve host information (eg. cluster/subcluster)
- Krb used in replacement for SINDES CA for accessing SINDES server
- Same client tool for users and hosts
- Major improvement of code maintainability: 300 LOC instead of 10K in v1
- Site-specific authorization modules to use whatever is appropriate at the site: CERN is using egroups, LANDB…
- flexible item to host/cluster/subcluster mapping
Impact of SINDES CA removal: not necessarily more difficult to setup Krb than SINDES CA…
- SINDES CA adds a lot of complexity to SINDES code, not necessarily secure…
- May consider adding support for optional other authentication methods, like certificates
Unattended installation
- CERN: initial keytab download allowed if request coming from low ports and reverse DNS matches (no time window)
- MS: adding a time-window feature to Krb
Only client available in v1 is command line, other APIs under consideration for v2
- May change the implication of a GPL license
Quattor and CERN Security - L. Munoz Meijias
Applying SW updates
- CERN IT preparing snapshots of recommended versions every week, users picks one of them
- In case of major security vulnerability/problem, CERN Security can make mandatory an update
Challenge of managing 250M accounts… in particular challenge of maintaining ACLs
- Unix group are not an appropriate solution: difficult to maintain, no recursive groups
- LDAP + egroups used to implement the required features
- egroups: logical name for a set of accounts, recursive
- Quattor used to configure LDAP authorization
- Still an issue to configure non LDAP-aware services
Access to accounts: mainly through Krb (+PAM)
- SSH keys are a potential problem in case of computers compromised outside CERN, no way to know if they have been removed from all CERN computers
- Privileged accounts: no possibility to ssh out from a privilege account
- Need to restrict more accounts (apache, oracle)
- Would like to restrict the allowed origin of incoming connections to privilege accounts
Multi-factor authentication: no uniform way of doing it, work in progress to make progress on this
Logging of all commands and command arguments on sensitive systems using
snoopy
- Intercept calls to execv and sent to syslog what is executed
Monitoring with alarming of ~12 security-sensitive files per computer
- File permissions
Development Process and Tools
End of current sprint
See wiki.
Maven-based Build Tools - C. Loomis
Tools available
- Build plugins, configuration for full configuration module
- Archetype to build config for a new configuration module
- Creates tarballs as well as RPMs
Documentation available.
Usage status
- pan
- pan-templates
- ncm-query an ncm-cdispd: prototype available
Making releases
- “Magic” to do a release with Maven is easy:
mvn clean release:prepare release:perform
- Builds, tags, uploads packages, makes “site”, increments version
- But requires some specific permissions: write access to nexus repository for packages, write access to SF website for “site” information
- Allow anyone from command line to make release?
- Use continuous integration server (eg. Hudson) for releases? Preferred solution… but it is an additional service (not available on SF, need to manage identities)
Continuous integration service: much more benefits than just the release making
- Not necessary a big deal to set up the service
- Already have the StratusLab experience
- Is LAPP ready to do it (previously in charge of build infrastructure)?
More discussion tomorrow? Attempt to start a Hudson server?
Development Process Discussion
Generally good feeling for those who participated
- But not a very good feeling about EVO
- Let’s test the MS audio system for next standups
Meeting day and time: Thursday 2pm CET is not necessarily the best slot
- Start a Doodle to find another possible slot, preferably during the morning
- Wednesday may be a good choice, avoid Monday/Friday
Sprint duration: 1 month, smaller backlog
- Monthly meeting to close the sprint and have the user interaction slot
Configuration module management: how to know/update what is the recommended (stable) version to use?
- No great idea…
- Involve manual steps/work
- Must be sustainable…
- To be reviewed when we have converted all configuration modules to use Maven tools and have a continuous integration server in place.
Migration from SVN to Git: is it desirable? if yes, which steps?
- StratusLab feedback: required a bit of effort for SVN users but several benefits at the end
- Smaller, more specific repositories: smaller number of users with write access per repository
- External references are really external to the repository
- At SF, all users with write access in the project have write access to Git repositories by default but can be restricted per repository
- Short term steps: new projects (eg. SINDES) in Git, panc moved from SVN to Git
- Need a candidate with interaction between several developers: CCM-related things (ccm, ncm-cdispd, ncm-ncd, QuattorFS), AII and its plugins
RHEL6
CERN Experience
- Path to install perl modules has changed: fixed by Luis for every Perl modules in SVN
- Now installed in /usr/lib/perl, that applications are not supposed to use (not used by anybody)
- Using standard installation place makes RPM OS-version dependent
- Use /opt/quattor/lib/perl: upgrade ncm-ncd to look in both places
rpmt-py
broken by 2 changes in standard Python RPM bindings: fixed by Christoscurl
bug breaking SINDES and potentially anything usingwget
: certificate not looked in the right place- No workaround/fix yet
- No impact known yet on Quattor itself
ncm-grub
doesn’t update the kernel after a kernel upgrade and %post script in RPM is failing- Christos will try to look at this
- Also a need to add kernel architecture to grub command
SELinux
: issue with rpmt-py, need to be documented on Quattor wiki- LDAP configuration is now in another place with another layout (previous information as a string now several tokens): ncm-authconfig has to be enhanced
- Difficult to split the existing string in SL6 tokens: site specific
- Rework the schema to have something more appropriate to handle different platforms?
- Luis will try to summarize the problem and possible solutions on quattor-devel
- Not clear if others are using LDAP for authentication…
- Time synchronization in virtual machines run on HyperV: need to run
new
ntpdate
service during startup to force synchronization - Initial time seems to be UTC…
- ncm-modprobe: modprobe configuration files changed their location, new attribute in the schema to specify this location
- Need to be handled by OS configuration templates
- ncm-symlinks: use of last version required
AUTH
- A few fixes to be committed
MS
- Pb with AII file system partitioning (caches not flushed or something like that): thinking at using standard Anaconda for this, driving it from Quattor config
- In the past, issues with Anaconda’s handling of file system partitioning
- ncm-pam issue, about to commit the fix
Discussion
RHEL6 raises the question of the potential changes that may be required for multiple-platform support
- MS wants to restart Solaris port
- Some components may be valid only on one platform: how to prevent them to be used on the wrong one
- Bind the path in configuration to something that prevents its use
Conclusions - Next Workshop
Next workshop in Strasbourg.
- Best candidates look week 41 (10-14/10) and 42 (17-21/10)